Going beyond the means: Exploring the role of bias from digital determinants of health in technologies

Background In light of recent retrospective studies revealing evidence of disparities in access to medical technology and of bias in measurements, this narrative review assesses digital determinants of health (DDoH) in both technologies and medical formulae that demonstrate either evidence of bias or suboptimal performance, identifies potential mechanisms behind such bias, and proposes potential methods or avenues that can guide future efforts to address these disparities. Approach Mechanisms are broadly grouped into physical and biological biases (e.g., pulse oximetry, non-contact infrared thermometry [NCIT]), interaction of human factors and cultural practices (e.g., electroencephalography [EEG]), and interpretation bias (e.g, pulmonary function tests [PFT], optical coherence tomography [OCT], and Humphrey visual field [HVF] testing). This review scope specifically excludes technologies incorporating artificial intelligence and machine learning. For each technology, we identify both clinical and research recommendations. Conclusions Many of the DDoH mechanisms encountered in medical technologies and formulae result in lower accuracy or lower validity when applied to patients outside the initial scope of development or validation. Our clinical recommendations caution clinical users in completely trusting result validity and suggest correlating with other measurement modalities robust to the DDoH mechanism (e.g., arterial blood gas for pulse oximetry, core temperatures for NCIT). Our research recommendations suggest not only increasing diversity in development and validation, but also awareness in the modalities of diversity required (e.g., skin pigmentation for pulse oximetry but skin pigmentation and sex/hormonal variation for NCIT). By increasing diversity that better reflects patients in all scenarios of use, we can mitigate DDoH mechanisms and increase trust and validity in clinical practice and research.


Introduction
Novel medical technologies have arisen to assist clinical teams and facilitate diagnosis by physicians, especially under budget constraints: Since 2010, 523 new medical devices have been approved for commercialization by the Food and Drug Administration (FDA).In parallel with this development, retrospective studies have revealed evidence of disparities in access to medical technology and of bias in the measurements resulting from such devices.
While previous articles have focused on social and economic determinants of health, this narrative review investigates digital determinants of health, as defined earlier in this collection [1].Specifically, it identifies digital technologies and medical formulae that demonstrate evidence of bias or suboptimal performance.Such pitfalls generally arise from insufficient consideration of patient diversity.Herein, we describe some known physical or biological mechanisms underpinning differences among patients-including those based on sex (either current or at birth), race, and ethnicity-and identify ways in which these characteristics affect the accuracy of digital medical technology for some populations.One such example is pulse oximetry: disparities in its performance among racial groups are thought to result from a lack of patient diversity in clinical trials [2].Another example is body temperature measurement: Differential thermoregulation among females affects the estimates provided by some thermometers.Further, we explain possible repercussions of these biases on digital determinants of health, formulate potential reasons why inadequate patient sampling has resulted in such impacts, and derive implications for clinical care.Finally, when applicable, we present existing solutions to mitigate these biases or suggest ways that corrections may be developed.This review does not cover the impact of medical technologies relying on artificial intelligence and machine learning, such as algorithms for clinical decision-making.Indeed, insufficient diversity in patient sampling-commonly due to selection bias, inequitable decisionmaking, or systemic racism-has already been well documented as influencing the performance of these models [3].This review also excludes the direct impacts of social determinants of health, such as the underdetection of diabetes among patients of color due to factors affecting their access to medical care, which is also a topic well documented in the literature.
Based upon the framework by Kadambi and colleagues [4], we organize this review as follows.First, we describe biases based on characteristics that patients were born with and that are immutable without active intervention, e.g., biological sex at birth or skin tone (Physical and biological bias section).Then, we transition to discuss biases that can result from the confluence of medical technology and community-dependent cultural or social norms, e.g., the quality of electroencephalography (EEG) signals may vary with patient hairstyles (Interaction of human factors and cultural practices section).Finally, we consider biases resulting from design choice or interpretation (e.g., formulae for pulmonary function tests (PFTs) including race as a factor, although valid alternatives excluding it have been proposed) (Interpretation bias section).

Definitions
Some terms used to characterize the 3 types of bias listed above have inconsistent definitions.As a preamble, we define the terms skin tone, calibration, and discrimination.
This paper uses the terminology skin tone instead of skin color to describe the color of the skin, including concepts of melanin concentration along with jaundice and other skin pigments.We believe this framing choice is essential as the term is more nuanced and inclusive, including all gradients of skin pigmentation.
Moreover, we reference Alba and colleagues in definitions of calibration and discrimination [5].Calibration refers to the accuracy of absolute estimates, effectively comparing empirical observations with predicted or estimated measurements (e.g., arterial blood gas oxygen saturation versus pulse oximetry) and improving their correlation by adjusting device settings.Discrimination refers to how well a model can differentiate between groups.
We also use the standard terminology of verification, analytic validation, and clinical validation following the V3 Framework [6].

Physical and biological bias
The characteristics of a patient's skin can influence the performance of medical devices, as illustrated by BioMetric Monitoring Technologies (BioMeTs).Pulse oximetry and non-contact infrared thermometry (NCIT) provide 2 such examples.

Pulse oximetry
Pulse oximetry is a common device that measures oxygen saturation or SpO 2 .Physiologically, pulse oximeters compare absorption at 2 wavelengths of light to estimate the ratio between deoxyhemoglobin and oxyhemoglobin in arterial blood, thereby performing a simplified version of spectrometry [7].
Differential performance of pulse oximetry across patient subpopulations has been known for over 4 decades [8][9][10][11][12] but has recently been brought back to the forefront due to largescale data analyses by Sjoding and colleagues [13], Wong and colleagues, and Henry and colleagues [13][14][15], suggesting persisting racial-ethnic disparities in oxygen readings.Despite being debated, evidence suggests that pulse oximeters overestimate actual oxygen levels in hospitalized and intensive care unit (ICU) patients, especially at lower oxygen saturations [16].Oxygen saturation measurements may be influenced by melanin (Fig 1), a chromophore of the skin present in higher concentrations in patients of darker skin tone that affects light absorption-a key element underlying this technology [7].This artifact appears to be the likely mechanism underpinning disparities in the performance of pulse oximeters among racial-ethnic subgroups [8,9].
In the United States, the FDA, which regulates medical devices and ensures their safety and effectiveness, requires at least 2 individuals or >15% of patients participating in trials evaluating new pulse oximeters to have "darkly pigmented skin."Yet, clear guidance on how skin pigmentation should be quantified or measured is still lacking [17].Results from the large studies by Sjoding, Wong, and Henry [13][14][15] demonstrate persisting limitations, with heterogeneous mean absolute percentage errors in the estimation of blood oxygen saturation across racialethnic subgroups, reemphasizing that FDA requirements were insufficient to ensure appropriate calibration of pulse oximeters.In response to the publication by Sjoding and colleagues, the FDA released a warning to raise awareness among clinicians about the potential lower accuracy of pulse oximeters for patients with darker skin.However, the agency currently provides no specific recommendations to counteract such measurement biases in device evaluation or in clinical practice [18].
Several approaches might address the lack of systematic device evaluation.First, similar to the performance of other medical devices, the evaluation of pulse oximeters can be improved by enrolling more patients in clinical trials and by ensuring a more diverse set of patients, i.e., with a variety of skin tones.Determining the optimal overall sample size and sociodemographic composition of a trial can be challenging.This reality warrants more methodological work to guide power analysis by anticipating effect sizes and calculating adequate population size and distribution across patient strata.Meanwhile, case studies of a few thousand patients would be an acceptable option.Beyond enhancements in study design and data collection, better representation of patients from different racial and ethnic origins should be sought.Despite guidelines from the National Institutes of Health (NIH) advocating for better representation of communities of color in clinical research [19,20] and similar commitments by the FDA Office of Minority Health and Health Equity [21], barriers to their enrollment remain at multiple levels: (a) systemic (e.g., community hospitals lacking the infrastructure to support clinical trials, despite capturing a more diverse population); (b) individual (e.g., the reluctance of healthcare professionals to register patients from underrepresented racial-ethnic communities due to implicit bias of oft-speculated lower adherence to assigned treatment); and (c) interpersonal (e.g., doctor-patient relationship and building of trust required for a patient to accept to join a trial).Moreover, some patients may have a historically motivated mistrust of the research enterprise associated with violating their human rights in the past [22,23].In America, the All of Us program launched by the NIH in 2018 was the first step toward improved patient representation.Expected to continue for at least a decade, this study aims to collect data from over 1 million people of different racial-ethnic origins, ages, and backgrounds who live in all parts of the country [24].Since measurement inaccuracies are thought to be more prevalent among individuals with darker skin [8,9,[13][14][15], it may be relevant to oversample patients across a gradient of darker skin tones when recruiting for studies.Going forward, computational modeling will be key to enhancing study population design by predicting likely effect size ranges via simulations.For example, tissue-mimicking phantoms that closely reproduce the properties of human tissue can be leveraged to elaborate on existing medical devices or propose new treatment options [25].These bench-top methods are already used in optics to understand the optical characteristics of biological tissues, standardize bio-optical techniques, and calibrate metrics on human-like tissues before issuing a clinical trial [26].Similarly, quantitative biology and pharmacology studies increasingly rely on interconnected microphysiological systems or organs-on-chips [27].Using a feedback loop process, these in vitro studies can be tested experimentally and results of in vivo tests integrated in the simulation pipeline through iterative updates.
Second, by learning from the discrepancies observed among patients of a given skin tone using paired arterial blood oxygen saturation (SaO 2 ) and pulse oximetry (SpO 2 ) measurements, statistical solutions could potentially be developed to de-bias raw measurements from the pulse oximeter.One way to address these first 2 approaches could include estimating weightings for measurement value adjustment as a function of skin tone, age, and other patient characteristics, and then applying them as part of a correction formula.To our knowledge, existing devices do not currently implement such a strategy.
Third, new pulse oximeter architectures are being developed to address the numerous noise sources in the photoplethysmography (PPG) waveform, forming the basis for SpO 2 measurement.The PPG signal is affected by individual user variations (e.g., skin tone, skin thickness, body mass index (BMI), age, temperature, perfusion index, and sex) and environmental perturbations (e.g., motion artifact, sensor placement).Several studies have shown the use of polarized imaging-based techniques to discriminate between light components reflected from various penetration depths to suppress skin effects and improve SpO 2 accuracy [28,29].Such architectures may reduce inaccuracies in oxygen saturation measurements and ensure similar calibration across subpopulations.Further, these 3 solutions-from revised clinical trial cohort composition principles to the estimation of statistical learning-based correction terms and improved device design-could be combined to allow their respective effects to compound.All recommendations for pulse oximetry have been summarized in Table 1.

Non-contact infrared thermometers (NCITs) and temporal artery thermometers (TATs)
As a vital sign, body temperature is routinely monitored in hospital settings.It is generally used to assess health status, facilitate diagnosis, and target treatments [30,31].Given the widespread use of non-contact infrared thermometers (NCITs) during the Coronavirus Disease  [32], the evaluation of potential racial and ethnic biases in the performance of such devices has reemerged.There is a precedent for using NCITs in emergency settings.For example, NCITs have already been used to screen for fever during past epidemics, including SARS in 2003 and H1N1 in 2009 [33][34][35]; they are also currently recommended as a useful screening device for prevention in the FDA's COVID-19 pandemic guidelines [36].Despite widespread adoption of NCITs in response to the COVID-19 pandemic, evidence comparing the performance of NCITs with that of devices commonly used for temperature measurement in adults is lacking.This prompted Australian researchers in May 2021 to study the difference between temperature measurements taken by NCITs and temporal artery thermometers (TATs)-considered as a gold standard device for inpatient care in Australian hospitals [31].Both devices use infrared sensors and estimate body temperature from skin temperature measurements.However, patient characteristics such as skin tone and biological sex can affect the accuracy of temperature measurements [31].
Overall, NCITs were less precise than reference TATs, as measured by the absolute mean difference between measurement types [31] (Fig 2).Specifically, according to the Australian study, patients with light skin tone had a larger difference between body temperature estimates resulting from the 2 devices (0.27˚C) than those with medium dark skin tone (0.12˚C).Additionally, NCIT demonstrated a larger difference in females (0.32˚C) than in males (0.21˚C).In contrast with other medical devices, where inaccuracies mostly arise in darker-skinned individuals, the lack of instrument precision affecting NCITs-as estimated by the absolute mean difference with the reference measurement-was higher for light-skinned individuals.
Of clinical importance, the difference in body temperature estimates derived from the 2 thermometer types was larger when the actual body temperature was higher than 37.5˚C (99.5˚F).In these circumstances, using an NCIT rather than a TAT could lead to an incorrect diagnosis since most healthcare providers consider a patient to have a fever when their body temperature exceeds 38˚C (100.4˚F)[37].Although these deviations may seem small, the normal body temperature only ranges from 36.16˚C to 37.02˚C (97.1 to 98.6˚F); therefore, the observed differences based on skin tone and sex represent up to 37% of the overall healthy range of body temperatures [38].Given this tight interval of temperature values, a deviation of up to 0.5˚C can span up to half of this range.Similarly, in another recent retrospective study [39], the use of temporal rather than oral temperature measurements consistently yielded a lower likelihood of identifying fever in Black patients, irrespective of the considered temperature cutoff, while no such difference was found in White patients.
In women, temperature fluctuations due to hormone cycles can prevent reliable comparison of temperature measurements over time and thus further complicate patient evaluation and subsequent treatment decision-making.Indeed, the luteal phase of the menstrual cycle (and high-hormone phases in women using oral contraceptives) is associated with an increase in body temperature by 0.5˚C [40,41].In parallel, prior research has shown that females show greater thermal responses to exogenous and endogenous heat loss than males-a likely cause of mismeasurements in body temperature [42].This influence could affect the infrared energy measured by these devices.Therefore, the inaccuracy of temperature measurements for a given individual may be subject to time-varying perturbations, e.g., during different phases of the hormonal cycle.This reality could have important implications in medical practice.For example, a clinician monitoring a female patient with COVID-19 may not reliably track the status of their patient as daily changes in estimated body temperature [43].This difficulty can result from either underlying device inaccuracy, hormone-induced fluctuations, or the combination of both factors-making causal interpretation challenging.In summary, a negative difference in estimated basal body temperature measurements could lead to underdiagnosis (false negative).Conversely, a positive difference could potentially lead to overdiagnosis (false positive).
Existing research suggests that the design of thermometers may have contributed to such discrepancies: the intrinsic properties of a patient's skin (e.g., tone, thickness, perspiration) https://doi.org/10.1371/journal.pdig.0000244.g002[31], likely to affect estimated body temperature measurements, may not have been considered by manufacturers while calibrating the device.
Going forward, improving the sensitivity-specificity trade-off of NCITs will thus require elaborating patient-specific adjustment factors for temperature.In particular, further studies are needed to determine the influence of other patient-related factors-such as age, blood flow under the skin, metabolic rate, cardiac output, and hormonal levels-on the accuracy and reliability of temperature measurements.This step will be crucial to mitigating the thermometer's inaccuracy in women and individuals with varying skin tones.Moreover, more research should be dedicated to characterizing intersectional sources of bias, e.g., biological sex and skin tone, associated with this medical technology to remediate improper device evaluation.To date, no dataset has a sufficiently large sample size to quantify differences between temperature measurements emanating from NCITs and TATs, stratified by biological sex and skin tone.However, multiple sources of bias could compound in practice: Although there currently is no evidence for this yet, device discrepancies associated with more pronounced thermal responses could add to those related to skin tone, yielding higher rates of inaccurate body temperature estimates, for example, among females with fever and with a lighter skin tone.Adequate documentation and reporting of the characteristics of patients enrolled in prospective studies and trials evaluating thermometers should be emphasized.Without such documentation, quantifying digital determinants of health, monitoring their temporal evolution, and correcting for addressable biases in measurements of both body temperature and other vital signs will be infeasible.All recommendations for NCIT have been summarized in Table 2.

Understanding more about the relationship between skin tone and DDoH
More generally, the issues seen in the case of pulse oximeters and NCITs may be encountered in other non-invasive medical technologies that involve light and in particular infrared light.Optical techniques routinely used in healthcare rely heavily on absorption and scattering properties of light in human tissue, which depend on the skin's thickness and the density of chromophores such as melanin, among other factors [44][45][46] (Fig 1).Because melanin levels determine skin tone, incorporating the latter as an adjustment factor is critical when developing medical devices.Given the complexity of skin tone gradients among patients-within and across races, ethnicities, anatomical sites [47], geographies, and cultures, using a continuous variable for skin tone seems ideal.Yet, performing this kind of measurement may be practically inconvenient and more time-consuming than self-reporting or categorical classification based on human perception, especially when the trial is sizable.Furthermore, in trial settings, transparent reporting of the composition of patient cohorts and the performance of medical devices may be more challenging if adopting a continuous scale for skin tone.
While most past trials only stratified patients into 2 groups (namely, light versus dark skin tone) due to practical or sample size considerations, more recent studies use the Fitzpatrick scale to categorize patients into 6 different skin tone categories.However, the Fitzpatrick scale was created to estimate the response to ultraviolet exposure across skin types in dermatological research and was not intended for medical device testing.Additionally, it was developed based on a patient cohort that only included Caucasians [48].With only 2 categories capturing darker skin tones, the Fitzpatrick scale does not equally represent all racial subgroups and may thus lead to biases when validating medical devices.To address the issue of the limited number of skin tone categories, other scales have been developed, including the CIELAB system (1976), CIECAM02 (2002), and, more recently, the Monk skin tone scale [49].Interestingly, although the 3D CIELAB framework was not specifically designed for dermatology but instead part of a larger initiative to standardize color ordering systems, it was later found to correlate skin tone well, with 1 parameter capturing pigmentation level and the 2 others defining chroma and hue.In contrast, the Monk scale was intentionally created by sociologist Dr. Ellis Monk in partnership with Google Research to provide a more inclusive spectrum of skin tones.Now incorporated into Google's products, it can be leveraged to enhance representation and labeling in computer vision datasets and improve the evaluation of machine learning models concerning fairness metrics.Historically, attempts to define richer skin tone scales and colorimetric color spaces beyond dermatology had been made prior to Fitzpatrick's research.In the early 1900s, von Luschan had proposed the 36-category scale subsequently used in race studies and anthropometry, while Munsell created the general-purpose 3-category colorimetric color space later selected by the US government for soil and geological research.Yet, they presented some pitfalls, including the inconsistency of their measurements at the time based on human perception.In the continuity of these pioneering inventions, advancements in computer vision and recognition systems made over the past decade have alleviated the limitations of human perception and unlocked the use of broader scales.However, using richer skin tone scales and reporting medical device performance based on such a spectrum rather than on discrete categories may require larger patient study cohorts.Furthermore, standardized skin tone scales alone are insufficient.While technologies are now available to more objectively assign a continuous numerical value corresponding to the skin tone of a given individual [50], their adoption in research and clinical practice would benefit from policies imposing the inclusion of a more representative set of skin tones in prospective studies.For example, objective skin tone measurements can be obtained using reflectance spectrometry with multiple wavelengths or cutaneous colorimeters based on general-purpose color systems such as the CIE color space.Spectrophotometers measure the transmittance and reflectance of light through a medium as a function of wavelength in all regions of the electromagnetic spectrum.In contrast, colorimeters measure absorbance, i.e., the extent to which a medium absorbs a specific color of visible light.Since cutaneous colorimeters measure the skin tone by reflecting the absorbed wavelength into trichromatic filters of red, green, and blue, subjective biases owing to arbitrary boundaries of skin tone categories set by humans can be avoided.Most importantly, the self-reporting of skin tone should be strictly limited.Instead, NIH and FDA policymakers should establish and enforce a standardized method to measure skin tones.In addition to promoting the harmonized measurement of skin tone, regulations could more closely monitor the release of medical technologies relying on light to ensure that they are not biased towards patients with specific skin tones.On the manufacturing side, for skin tone determination and other biometric measurements, making health devices more accessible and equitable across racial and ethnic subgroups will require the convergence of hardware and algorithms.
Nevertheless, this would have to happen in tandem with standardization and documentation of measurement types used in research studies and prospective trials.

Interaction of human factors and cultural practices
However, bias in medical devices is not limited to skin tone.Human factors and cultural practices can also contribute to unexpected consequences in the performance of medical technologies.

Neurological diseases
EEG provides another example of additional sources of bias.Because the technology is inexpensive and offers high-quality spatial resolution, it is one of the most popular techniques in neuroscience research, neurology, and sleep medicine.In particular, EEG signals are used to diagnose and inform the treatment of brain and sleep disorders and diseases such as epilepsy, seizure(s), brain tumors, stroke, or brain inflammation.Nonetheless, neuroscience research may suffer from unintended racial biases due partly to using physical electrodes in EEG [51].
While both the adhesion of electrodes to the patient's scalp and their correct placement are essential to obtain neural responses from the scalp into the EEG sensors' electrodes, suboptimal adhesion and placement can introduce noise and dampen signal amplitude [51,52].Some studies have shown that Black hair is less likely to absorb liquid than Asian and Caucasian hair [53].This may prevent the saline solution or conductive gel from acting as a proper conductor [51], thereby increasing the impedance of the sensor and ultimately reducing the quality of the neural response signals [51].In practice, this artifact also brings cultural adaptation challenges and may result in lower participation of Black patients during data collection, as they might have to change their hairstyle (e.g., by removing cornrows or braids) to join the study [51].Therefore, results from neuroscience studies can be more difficult to generalize to the Black patient population and extrapolation based on their Asian or Caucasian counterparts may be problematic (Fig 3).
Recent research [54] aims to improve the design of hair clips and electrodes to ensure their compatibility with a wider variety of hairstyles.However, newer prototypes present some pitfalls: for example, they have fewer than 128 channels on the skull, the current standard in EEG research.The bias is not solely due to sensor density, as skin conductance levels are significantly lower in Black patients than in their Asian and Caucasian counterparts [55].Beyond electrophysiology, other brain recording techniques may also be affected by racial and ethnic biases.For example, functional near-infrared spectroscopy (fNIRS) can be impaired when performed on patients with dark hair because this technology relies on shining infrared light, measuring its reflection with optrodes to quantify blood flow in the brain [56].All recommendations for EEG have been summarized in Table 3.

Interpretation bias
Finally, even when medical devices generate unbiased data, their interpretation can be biased.The case of PFTs (spirometry) and examples taken from ophthalmology illustrate issues related to such data interpretation bias.

Pulmonary medicine and spirometry
The perception that the lungs of certain patients are inherently inferior as a function of their race and ethnicity has been built into American medical practice since Samuel Cartright and Benjamin Gould [57].PFTs assess lung function by measuring spirometry (i.e., airflow), lung volumes, and tissue function.Several components, including spirometry, involve correction factors, e.g., age, height, sex, and race.Normal ranges for PFTs are derived from vast samples such as the National Health and Nutrition Examination Survey (NHANES) or the Global Lung Initiative (GLI), which are supposed to be representative of the national population.Therefore, the corresponding lower and upper bounds are not patient-or subgroup-specific; instead, they only reflect aggregate measures emanating from surveys and represent population-level admissible values.Nevertheless, for such reference ranges to be useful in clinical practice, baseline populations must accurately characterize and reflect every patient.For example, although the patient cohort underlying the GLI PFT included black Americans, the reference equations in use were not appropriately rendering heterogeneity across the African continent and poorly fitted West African patients [58].
In addition to inherent racial and ethnic differences, studies suggest that socioeconomic factors and environmental exposures can significantly alter lung function.Indeed, existing biases result from both inherent and acquired factors [59].For example, not only low birth weight, secondhand smoke, and air pollution are all associated with slower lung growth and lower lung function at an adult age [60,61].Consequently, the range of normal values in the population may be affected by acquired influences such as community-level environmental and behavioral health factors, and the true normal may be higher than currently presumed (Fig 4).
However, PFT ranges are commonly used to determine the presence of disease.Because certain racial-ethnic subgroups can map to negative adjustment factors, a given low PFT value can be classified as abnormal in White patients but as normal in Black patients [61].In addition to race and ethnicity, age plays an important role in the estimation of PFT ranges.At the transition between pediatric to adulthood (classically 18 years of age), using both pediatric and adult reference equations on the same patient could yield results varying by −14% to 38%, making their interpretation and subsequent decision-making difficult [62].Age can further compound with race and ethnicity-among older patients from less adequately represented subgroups, e.g., older Asian patients, GLI equations were found to be less accurate than in other subgroups [63][64][65].Therefore, determinations based on a single, nonadaptive threshold may negatively affect a broad array of clinical decisions-ranging from the prescription of medications [66] to the timely recognition of occupational lung disease [67], disease workup, and management [68].More recently, inaccurate PFT interpretations could also impact patients recovering from COVID-19 disease, from being diagnosed with secondary pulmonary fibrosis to receiving pulmonary rehabilitation [69].All recommendations for PFTs have been summarized in Table 4.

Ophthalmology: Optical coherence tomography and Humphrey visual fields
New digital technologies have transformed ophthalmology.Some are already widely used in the clinical management of many diseases, including glaucoma and retinal diseases.Such procedures have yielded rich longitudinal, high-resolution databases containing clinical, imaging, and diagnostic testing information.However, biases in the underlying data generating processes can impact the clinical interpretation of the results and retrospective analyses conducted using these databases.Ophthalmologic technology offers 2 examples illustrative of such biases: optical coherence tomography (OCT) and Humphrey visual field (HVF) testing.
OCT technology is used to measure and map anatomical structures of the retina and optic nerve.The definition of normal and abnormal in OCT machines varies by manufacturer [70].Each patient's measurements are compared to population-wide distributions derived from testing several hundred normal patients usually selected from the country where the machine was manufactured.As a result, the population used to determine norms varies by machine.Its characteristics may differ from those of the population in which it will be used for clinical care, making the direct transfer of the technology-without intermediate recalibration-potentially biased.
Epidemiological studies show that OCT norms such as the thicknesses of the retina and nerve fiber layer (NFL) can vary by age [71,72], biological sex [73][74][75], race [72,75], and ethnicity [71,76,77].Thinning from these NFL standards can be used to diagnose glaucoma and monitor progression [78].However, a study of NFL thickness in a multiethnic Asian population revealed that it was 7.5 microns (approximately 7.7%) lower in Indian patients than in Malay or Chinese patients [79].The authors report that this difference can adversely impact the sensitivity and specificity of glaucoma detection.They recommend refining OCT norms to reflect these ethnic but not racial differences.
Similarly, peripapillary capillary density, an OCT angiographic metric used to diagnose glaucoma, varies by race [80].Patients of African descent have a lower peripapillary capillary density than patients of European descent, thus altering the sensitivity of glaucoma detection in patients of African descent, relative to patients of European descent.
HVF testing measures the sensitivity of the visual field and reveals defects which may occur in glaucoma and many other ophthalmologic conditions.The results of HVF testing can vary by age [81], biological sex [82], and race [72,73]-factors thought to be clinically relevant to diagnostic sensitivity.For example, longitudinal HVF test-retest measurements in patients with glaucoma show higher variability in patients of African descent than in those of European descent.This increased variability induces higher uncertainty in clinical decision-making, with modeling studies indicating delays in the diagnosis of glaucoma by up to 3 years among patients of African descent [73].
Addressing biases in ophthalmologic technologies will require collecting more granular sources of data from as many countries as possible to allow for input on norms and covariates from all segments of humanity [83].All recommendations for OCT and HVF have been summarized in Table 5.

Technology Recommendation Strength Level of evidence
PFTs Clinical: Clinicians should be cautious in ensuring that the appropriate reference standard matches their patient.For example, the GLI PFT references for black Americans may poorly fit patients from West African nations.

B II
Research: More research is needed to understand the balance between sufficiently relevant and granular populations to apply to the patient while being broad enough to capture a clinically relevant definition of normal ranges.E V GLI, Global Lung Initiative; PFT, pulmonary function test. https://doi.org/10.1371/journal.pdig.0000244.t004

Discussion
This review has highlighted the different sources of potential bias in selected medical diagnostics, including physical, biological, and interpretation bias.For the convenience, a summary of all recommendations is highlighted in Table 6.Such biases can stem from flawed inclusion criteria, product design, or device validation.Improving technologies to eliminate bias needs to occur at all levels.Researchers and manufacturers must develop unbiased technologies from the initial stages of their system design, whether in the choice of wavelengths and imaging methods for tools involving the patient's skin, in the calibration of clinical equations or medical devices, or in the selection of their patient cohort.Finally, patients and even more so clinicians should be more alert to biases that exist in current technologies to avoid misinterpretation of results that may in turn lead to misdiagnosis.Furthermore, the role of industrial partners is not negligible.Health technology companies should increasingly adopt corporate social responsibility practices to incorporate ethics into their research practice and product design, thereby mitigating bias.Globally, it is crucial that organizations and policymakers worldwide, including the World Health Organization within the United Nations system, the European and African Unions, and the Association of Southeast Asian Nations (ASEAN), continue working toward universal health coverage and promoting the development of technologies that work for all types of individuals.
The feasibility of the solutions proposed in this narrative review highly depends on the countries involved.For example, regulation from the FDA could force US companies to develop less biased technologies-based on prespecified thresholds and device performance metrics-for their product to be marketed.However, applying the same regulations in Europe might be more challenging since consensus among European Union member states and coordination of the legal arsenal through the European Commission and partner organizations would be required.More importantly, developing a stronger regulatory framework can be a double-edged sword in countries with limited access to medical devices but whose populations are more prone to the biases described above.

Future work and raising awareness
This review focused on selected domains that use BioMeTs with evidence of bias.For example, we did not mention wearable devices when discussing skin tone-related biases due to a lack of consensus in the research community and the need for larger trials to inform policy [50,84].According to a study by Bent and colleagues [85], the accuracy of PPG heart rate measurements in selected wearable devices does not differ significantly across skin tones.However, the research-grade wearable device evaluated in the study is less accurate than consumer-grade devices when patients are at rest.Further, the technology relies on light absorption and uses primarily the green wavelengths, which are also absorbed by melanin.Therefore, biases can

Grade of research:
A-Strongly recommend; good evidence.B-Recommend; at least fair evidence C-No recommendation for or against; balance of benefits and harms too close to justify a recommendation.D-Recommend against; fair evidence is ineffective or harm outweighs the benefit.E-Evidence is insufficient to recommend for or against routinely; evidence is lacking or of poor quality; benefits and harms cannot be determined.

Level of evidence:
Level I-Meta-analysis of multiple studies.
Level II-Experimental studies.
Level IV-Well-designed, non-experimental studies.
Level still emerge with new or homemade wearable devices and thus continued vigilance and evaluation is needed.Moreover, we focused on domains with clinical utility.Biases also arise in other areas of biology and medicine.For example, analyses performed in genetics often use a reference genome obtained from averaging across individuals that do not reflect the diversity of the world's population-the vast majority having European ancestry, thus making their application in other populations less effective [86].Additionally, currently available genomic databases [87] still suffer from a lack of diversity, despite certain populations having higher genetic heterogeneity [88].
This narrative review is the first step toward identifying biases and inaccuracies in the medical technology we build and the clinical knowledge we generate.It is certainly not a complete review; instead, it should be considered as a compilation of examples illustrating how subtle omissions and considerations can result in a significant real-world impact on patients.As with pulse oximetry, research findings are often not rapidly translated into practical considerations for clinical care: Studies had documented the existence of discrepancies in estimated blood oxygen saturation measured by pulse oximeters among racial-ethnic subgroups long before the topic made the headlines.In addition to faster translational science, from the lab to the bedside, we must consistently reevaluate the practice of clinical care itself and revise how health services are being delivered.

Limitations
Some of the studies discussed in this review, including research comparing body temperature measurements from NCIT and TAT devices, utilize convenience sampling methods to recruit participants.Thus, the resulting cohort of patients may not represent the target population, in which differences may ultimately be more significant than those reported in published work.In addition to limitations related to sampling methods, some studies have small sample sizes, such as Foglia and colleagues with 35 patients, which may limit the strength of conclusions [2].As evidenced by other research groups, these publications can be inadequately used to justify the current state of affairs [16].For example, there is a presumption of accuracy with insufficient statistical power when there may not be.We believe small studies should only be used to identify bias, but not to rule out the existence of bias.
Practical implementation of solutions to address the existing and documented digital determinants of health affecting medical devices that are already on the market will be a phased process.Fortunately, insufficient statistical power is no longer an issue for certain technologies, e.g., pulse oximetry.Yet, in light of recent publications [13][14][15], the next challenge for these technologies is to develop equivalent performance across all populations.As new digital products are being developed, research labs and manufacturers alike should exercise caution and be proactive to mitigate the potential for bias, both by determining optimal trial sample sizes and seeking sufficient representation in patient cohorts.
There is significant debate regarding the use of studies based on observational data to identify the presence of bias.For example, in his letter to the Orange County Business Journal [16], Joe Kiani, the CEO of Masimo, states that some studies contradict internal Masimo calibration data.However, device calibration data as measured by manufacturers are often kept private and not openly accessible to external research teams investigating sources of bias affecting medical technologies.Going forward, enforcing the release of tests conducted to assess device performance would help build trust in their reliability across patient subgroups.
Finally, device manufacturers should follow data interoperability standards (e.g., DICOM, Open mHealth, HL7 FHIR), unrestricted downloading capabilities, and harmonized data formatting to facilitate patient data acquisition and research that will help practitioners understand and mitigate potential biases in those they care for.

Recommendations
Table 6 summarizes recommendations and assesses both strength and level of evidence for all technologies reviewed in this manuscript.

Conclusion
In this review, we shed light on evidence of digital determinants of health in medical technologies and devices that do not rely on artificial intelligence and explore solutions to overcome these biases.In addition to differences in skin tones and light-based technologies, biases can arise from a lack of diversity in the composition of patient cohorts or from other physical characteristics.Future research should identify sources of bias left undetected to date, better document existing biases, and seek solutions to make health technologies and devices more accessible and accurate across all population groups.It is also critical to raise awareness about existing biases among researchers, practitioners, and policymakers in order to prevent the emergence of new types of bias.

Fig 1 .
Fig 1. Differential light absorption as a function of melanin levels and skin thickness.Alone or in combination, these factors can alter the performance of medical devices relying on red and/or infrared light.NCIT, non-contact infrared thermometry.https://doi.org/10.1371/journal.pdig.0000244.g001

Fig 2 .
Fig 2. This figure describes the discrepancy between NCIT and reference TAT measurements.Source: Data from [31] Khan and colleagues, Comparative accuracy testing of non-contact infrared thermometers and temporal artery thermometers in an adult hospital setting.Am J Infect Control.2021.NCIT, non-contact infrared thermometry; TAT, temporal artery thermometer.

Fig 4 .
Fig 4. Impact of inherent factors versus exposure on lung function.The first graph notes that intrinsic lung function is affected by age and height.From birth, our lungs grow until our mid-30s, after which they slowly decline over time.Lung growth is affected by other factors than age.Height directly affects lung volume, as it ties into overall chest cavity volume.Sex similarly affects chest cavity volume.Race and ethnicity may be associated.Finally, "maximum potential" lung function is altered by exposure to air pollution, secondhand smoke, and lower respiratory tract infection: they would either reduce maximal potential (if experienced early in life) or increase the rate of decline (if experienced after maximal growth has occurred).https://doi.org/10.1371/journal.pdig.0000244.g004

Table 1 . Summary for pulse oximetry recommendations. Technology Recommendation Strength Level of evidence
Research: Prospective trials should incorporate sufficient diversity among participants to identify differential rates of deleterious outcomes, such as patient status deterioration, increased length of stay, or mortality.E VResearch: Outcomes defined using pulse oximetry in research trials may need another method to measure oxygenation.E V https://doi.org/10.1371/journal.pdig.0000244.t0012019 (COVID-19) pandemic to detect fever associated with Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) infection

Table 2 . Summary for NCIT recommendations. Technology Recommendation Strength Level of evidence NCIT
Clinical: Use of non-contact infrared thermometers may demonstrate wider variability in temperatures above 37.5˚C, among patients with lighter skin pigmentation, and among females.Correlate clinically or with another modality.

Table 5 . Summary for OCT and HVF testing recommendations.
Using NFL thickness and peripapillary capillary density measurements for glaucoma diagnosis via OCT devices without having norms be validated in the population being studied (e.g., race, ethnicity, sex, and age) may affect sensitivity and specificity.Correlate clinically.

Table 6 . Recommendations as per AHRQ levels of evidence [89].
Pulse oximetry may be less accurate, especially as oxygen saturations decrease and in patients of color.Consider confirmatory testing with arterial blood gas and potentially modifying oxygen therapy.Clinicians should be cautious in ensuring that the appropriate reference standard matches their patient.For example, the GLI PFT references for black Americans may poorly fit patients from West African nations.Using nerve fiber layer thickness and peripapillary capillary density measurements for glaucoma diagnosis via OCT devices without having norms validated in the population being studied (e.g., race, ethnicity, sex, and age) may affect sensitivity and specificity.Correlate clinically.
V-Case reports and clinical examples.